prover: GPU compression path + plumbed (gated) GPU aggregation#3041
prover: GPU compression path + plumbed (gated) GPU aggregation#3041gbotrel wants to merge 4 commits into
Conversation
* Compression (data-availability-v2) auto-enables the gpu/plonk2 prover whenever a CUDA device is reachable. Wall-clock on the reference host drops from ~4:40 (CPU) to ~2:10 per proof. * Aggregation GPU plumbing (gpu/plonk2 PI/BW6/BN254 + gpu/vortex PI MiMC and ring-SIS + gpu/quotient) is wired but disabled by default behind $LINEA_PROVER_GPU_AGGREGATION; leave the flag off in production for now. * cmd/controller refuses execution / aggregation / invalidity jobs when a GPU is detected; only compression is accepted on a GPU host. See prover/reference-benchmarks/README.md for the host class, build command, runtime flags and 3-proof compression reference (avg 2:10.19 on AWS g7e.8xlarge with an RTX PRO 6000 Blackwell).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3041 +/- ##
=========================================
Coverage 75.84% 75.84%
Complexity 6844 6844
=========================================
Files 1121 1121
Lines 44508 44508
Branches 5355 5355
=========================================
Hits 33755 33755
Misses 9469 9469
Partials 1284 1284
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
- config-mainnet-limitless.toml: restore relative paths (dev-host absolute paths leaked into the committed prod config). - prover-testing.yml: run `go vet -tags=cuda ./gpu/...` in the static check job so CPU refactors that break GPU compilation are caught. vet compiles but does not link, so no CUDA toolchain needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c379756. Configure here.
Adds deterministic byte-level parity checks between the GPU plonk
prover's Fiat-Shamir helpers and the audited gnark CPU construction.
* TestFiatShamirChallengeParity (+ NoBsb22 variant) — replays the four
prover challenges (gamma, beta, alpha, zeta) through the GPU's
bindPublicData/deriveRandomness helpers and compares each derived
fr.Element against an inline reference built directly on
gnark-crypto's public fiat-shamir API. The reference mirrors
gnark CPU's exact bind order from backend/plonk/{curve}/{prove,
verify}.go.
* TestFiatShamirBatchOpenParity — exercises gpuBatchOpen's KZG-folding
FS instance against gnark-crypto's kzg.BatchOpenSinglePoint on
identical synthetic inputs (same polys, digests, claimed values,
point, dataTranscript, SRS, and folding hash). When the gamma
folding challenge matches byte-for-byte, the quotient commitment
H is bit-identical; any FS drift yields a different H.
Generated for bn254, bls12377, bw6761 via the existing template
pipeline. All 9 tests pass locally on RTX PRO 6000 Blackwell.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Summary
Wires GPU acceleration for the compression proof (data-availability-v2)
into the production prover and plumbs in — but does not enable — the GPU
path for the aggregation proof.
the reference host drops from ~4:40 (CPU) to ~2:10 per proof (3-proof
batch average 2:10.19 on AWS
g7e.8xlargewith one RTX PRO 6000 Blackwell).MiMC + ring-SIS, gpu/quotient) but off by default. Operators must set
LINEA_PROVER_GPU_AGGREGATION=1to opt in. Production should leave it offfor now.
accepted. Execution / aggregation / invalidity files are ignored even if
the corresponding
Enable*toggles are on, so a GPU host never falls backto a slow CPU path for non-compression work.
1. Build flags
The CPU binary is unchanged. The GPU binary requires the
cudabuild tag andlinks against the static
libgnark_gpu.aproduced fromprover/gpu/cuda.bin/prover-cudais a new make target. The static libraryprover/gpu/cuda/build/libgnark_gpu.amust already exist before linking; theCMake build is unchanged from the existing
gpu/cudasource tree.2. Host class per job type
3. Controller behavior on GPU hosts
cmd/controllerchecksgpu.HasDevice()at start. When true:*-getZkBlobCompressionProof.jsonjobs.EnableExecution,EnableAggregation,EnableInvalidityfrom the configare ignored even if they are
true.This is intentional — running the CPU-paths on a GPU host with 32 vCPU would
be much slower than dispatching them to the existing CPU pool.
CPU controllers are unchanged: same
Enable*semantics as today.4. Required runtime env vars
Compression (GPU host):
GOMEMLIMIT=180GiB GOGC=75 # nothing else; GPU is auto-detectedThese two values are baked into the reference run and keep peak Go heap
usage at ~200 GiB on a 249 GiB host without thrashing the GC.
Aggregation (CPU host) — unchanged from origin/main today.
5. Required runtime resources
VRAM usage observed: ~80 GiB. Do not schedule another GPU process on
the same card while a compression proof is in flight.
7.1.0/data-availability-v2/directory must bepresent on the host. The canonical SRS is read once per process and
benefits substantially from being in OS page cache; a freshly-booted host
pays ~2 min of cold-cache cost on the first proof. Subsequent runs
hit the table below.
staging buffers are reused across rounds and the gnark Go heap is
intentionally large under
GOMEMLIMIT=180GiB).6. Compression reference numbers (3 sorted requests)
30388561-3038902530389026-3038950430389505-30390023Average wall time: 2:10.19
Per-phase decomposition (from prover logs): solve 33 s → init GPU
instance ~19 s → MSM commit L,R,O ~4 s → build/iFFT/commit Z ~8 s →
quotient GPU ~25 s → MSM h₁,h₂,h₃ ~4 s → eval+linearize+open Z ~7 s →
batch opening ~4 s.
Raw artifacts under
prover/reference-benchmarks/results/2026-05-08-g7e-8xlarge-gpu-compression-final/.7. Proof-flow summary
8. Rollback
The compression GPU path can be disabled at runtime by deploying the
non-cuda
bin/prover(or by hiding the GPU device from the prover process,e.g.
CUDA_VISIBLE_DEVICES=""). No code change required — the prover fallsback to gnark's CPU PlonK prover.
The aggregation GPU path is off by default; nothing to roll back unless an
operator explicitly set
LINEA_PROVER_GPU_AGGREGATION=1(just unset it).Test plan
go build ./...(CPU)go build -tags cuda,debug ./...(GPU)go test ./gpu/plonk2/... -tags cuda,debug(per-curve correctness vs gnark CPU reference)go test ./gpu -tags cuda,debug(device singleton)go test ./circuits/... ./backend/aggregation/... ./cmd/controller/...(touched packages)provertestdata2× 3 sorted requests, all validg7e.8xlargeor equivalent)🤖 Generated with Claude Code
Note
High Risk
High risk because it introduces a new GPU-backed proving path (
gpu/plonk2via CGO/CUDA) and refactors setup/SRS loading and PI/quotient/vortex hashing logic; mistakes could cause incorrect proofs, runtime failures, or performance regressions across critical proving flows.Overview
Enables GPU-accelerated proving for data-availability (compression) by threading a new
circuits.WithGPUoption throughProveCheck, skipping Lagrange SRS loads when on GPU, and eagerly prefetching setups to reduce wall time.Plumbs a gated GPU path for aggregation (PI → BW6 → BN254) behind
LINEA_PROVER_GPU_AGGREGATION, including GPU-backed PI Vortex (MiMC + ring-SIS) and quotient coset reevaluation (CUDA-tagged implementations with CPU fallbacks).Adds CUDA build tooling and ergonomics:
bin/prover-cudamake target, CUDA typecheck in CI (go vet -tags=cuda ./gpu/...), newgpu/cudaCMake build files, and bumps the prover version/dependencies to support these changes.Reviewed by Cursor Bugbot for commit 066520e. Bugbot is set up for automated code reviews on this repo. Configure here.